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Guidance   on    Planning   and  Implementing 
Computer  System  Reliability 


Lynne  S.  Rosenthal 


Computer  systems  have  become  an  integral  part  of  most 
organizations.  The  need  to  provide  continuous,  correct  service 
is  becoming  more  critical.  However,  decentralization  of 
computing,  inexperienced  users,  and  larger  more  complex  systems 
make  for  operational  environments  that  make  it  difficult  to 
provide  continuous,  correct  service.  This  document  is  intended  for 
the  computer  system  manager  (or  user)  responsible  for  the 
specification,  measurement,  evaluation,  selection  or  management 
of  a  computer  system. 

This  report  addressess  the  concepts  and  concerns  associated  with 
computer  system  reliability.  Its  main  purpose  is  to  assist 
system  managers  in  acquiring  a  basic  understanding  of  computer 
system  reliability  and  to  suggest  actions  and  procedures  which 
can  help  them  establish  and  maintain  a  reliability  program. 
The  report  presents  discussions  on  quantifying  reliability  and 
assessing  the  quality  of  the  computer  system.  Design  and 
implementation  techniques  that  may  be  used  to  improve  the 
reliability  of  the  system  are  also  discussed.  Emphasis  is  placed 
on  understanding  the  need  for  reliability  and  the  elements  and 
activities  that  are  involved  in  implementing  a  reliability 
program. 

Key   words:      computer   system  reliability;    quantification  of 
reliability;  recovery  strategy;  reliability;  rel iabi 1 ity'program; 
reliability  requirements;  reliability  techniques. 
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1.0  INTRODUCTION 

Computer  systems  have  become  an  integral  part  of  most 
organizations.  As  the  computer  system  and  its  services  become  more 
essential  to  the  success  of  these  organizations,  the  ability  of  the 
system  to  process  information  correctly  and  to  provide  continuous 
service  becomes  even  more  critical.  However,  recent  trends  such  as 
decentralization  of  computing,  inexperienced  users,  and  larger,  more 
complex  systems  have  produced  operational  environments  which  thwart 
the  attainment  of  these  goals.  Rising  repair  costs  also  make  the 
reliability  of  the  general  purpose  computer  system  an  important  issue. 

Historically,  computing  has  been  dominated  by  the  large  general 
purpose  mainframe.  Associated  with  this  type  of  computer  system  is  a 
certain  set  of  reliability  questions  and  answers.  Although  this 
environment  is  changing  with  the  advent  of  microcomputers,  distributed 
data  processing,  and  distributed  data  bases,  many  of  the  reliability 
concerns  remain  the  same.  Whatever  the  system  configuration, 
reliability  continues  to  be  an  important  aspect  of  the  computer 
system . 


1 . 1  Purpose 

This  report  is  intended  to  assist  users  in  acquiring  a  basic 
understanding  of  computer  system  reliability  and  to  identify  areas  for 
further  examination.  It  presents  an  overview  of  the  fundamental 
concepts  and  concerns  associated  with  system  reliability,  and 
identifies  elements  and  activities  involved  in  planning  and 
implementing  a  reliability  program.  The  report  provides  general 
guidance  and  as  such  does  not  present  an  in-depth  methodology  for 
creating  or  maintaining  a  reliable  computer  system.  The  underlying 
theme  is  that  a  knowledge  of  reliability  is  important  in  the 
development  of  new  system  specifications  as  well  as  in  the  continual 
assessment  of  existing  computer  system. 


1 ,2  Scope 

We  are  concerned  here  with  the  services  and  facilities  of  a 
general  purpose  computer  system  in  a  multiuser  environment.  In 
general,  this  report  can  be  applied  to  other  types  of  computer  systems 
(i.e.  minicomputers,  microcomputers,  distributed  systems,  etc.).  The 
size  of  the  system,  its  complexityj  and  mission  within  an  organization 
will  dictate  the  set  of  applicable  guidance.  The  system  planner  (see 
definition,  section  1.3)  should  analyze  and  evaluate  this  report  with 
respect  to  the  system  and  apply  the  appropriate  subset. 

The  reliability  of  a  computer  system  will  be  discussed     in  terms 
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of  a  system's  three  major  components:  hardware,  software,  and 
hyman/machine  interface,  the  human  component.  An  in-depth  discussion 
of  any  one  component  is  beyond  the  scope  of  this  document.  Sources  of 
additional  reliability  information  in  these  areas  can  be  obtained  from 
Appendix  A:  Bibliography  and  Related  Readings.  Included,  are  several 
publications  prepared  by  the  National  Bureau  of  Standards',  Institute 
for  Computer  Sciences  and  Technology  in  the  areas  of  hardware  and 
software  reliability,  verification  and  validation  techniques,  and  risk 
assessment.  References  to  these  and  other  documents  will  help  to 
ensure  a  complete  understanding  of  reliability  and  its  related  areas. 


1 .3  Audience 

The  report  is  designed  primarily  for  use  by  those  who  are 
responsible  for  the  management,  specification,  measurement, 
evaluation,  or  selection,  of  a  computer  system.  The  information 
presented  may  also  be  relevant  to  individuals  who  are  associated  with 
the  data  processing  center  as  system  pr ogrammmers,  analysts, 
technicians,  and/or  users.  These  employees  may  require  knowledge  of 
reliability  issues  to  facilitate  the  management  of  their  areas  of 
responsibility  and  the  performance  of  their  assigned  tasks.  The  term 
"system  planner"  will  be  used  as  a  convenient  title  for  any  of  the 
person(s)  described  above. 


1.4     Document  Overview 

The  document  is  divided  into  several  sections  and    an  appendix. 

It    builds    upon    the    concepts    set    forth     in  the  early  sections  and 

concludes  with  a  discussion  of  a  system  reliability  program. 


Section  2  defines  reliability  and  related  terms,  and  addresses  the 
importance  of  reliability  to  the  system. 

Section  3  describes  procedures  for  the  quantification  of  the 
system  reliability,  in  particular,  sources  of  data,  metrics,  and 
the  assessment  of  the  metrics. 

Section  4  presents  a  general  discussion  of  the  reliability 
techniques  that  can  be  designed  into  computer  systems  or  can  be 
implemented  by  the  system  planner. 

Section  5  deals  with  the  recovery  of  a  computer  system  after  it 
has  failed  or  produced  erroneous  output. 

Section  6  summarizes  the  important  management  tasks,  activities, 
and  costs  involved  in  implementing  a  reliability  program. 

Appendix  A  contains  a  list  of  references  and  selected  readings. 
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2.0  FUNDAMENTAL  CONCEPTS 

2.1  Terminology 


The  term  system  is  used  to  denote  a  collection  of  interconnected 
components  designed  to  perform  a  set  of  particular  functions.  In 
applying  this  definition,  any  component  of  the  system  may  itself  be 
regarded  as  a  system;  e.g.  the  central  processing  unit,  memory, 
communications  to  and  from  the  system,  software  programs,  and  computer 
system  users  [McDE80]. 

For  purposes  of  this  report,  the  failure  of  the  computer  system 
will  refer  to  the  termination,  disruption,  corruption,  or  incorrect 
outcome  of  system  components  (e.g.     hardware,  software). 

The  rel lability  of  a  computer  system  is  defined  as  the 
probability  that  the  system  will  be  able  to  process  work  correctly  and 
completely  without  its  being  aborted  or  corrupted.  Note,  a  system 
does  not  have  to  fail  (crash)  for  it  to  be  unreliable.  The  computer 
configuration,  the  individual  components  comprising  the  system,  and 
the  system's  usage  are  all  contributing  factors  to  the  overall  system 
reliability.  As  the  reliability  of  these  elements  varies,  so  will  the 
level  of  system  reliability. 

The  availability  of  a  system  is  a  measure  of  the  amount  of  time 
that  the  system  is  actually  capable  of  accepting  and  performing  a 
user's  work.  The  terms  reliability  and  availability  are  closely 
related  and  often  used  (although  incorrectly)  as  synonyms.  For 
example,  a  system  which  fails  frequently,  but  is  restarted  quickly 
would  have  high  availability,  even  though  its  reliability  is  low 
[McDESOJ.  To  distinguish  between  the  two,  reliability  can  be  thought 
of  as  the  quality  of  service  and  availability  as  the  quantity  of 
service.  Throughout  this  guideline,  availability  will  be  viewed  as  a 
component  of  reliability. 


2.2     Reliability  Distinctions 

A  computer  system  consists  of  a  combination  of  hardware, 
software,  and  human  components,  each  of  which  can  cease  functioning 
correctly,  cause  another  component  to  malfunction,  or  help  to  increase 
the  reliability  of  the  other  components.  The  reliability  of  this 
combination  of  components  can  be  thought  of  as  computer  system 
reliability.  Traditionally,  the  reliability  of  the  computer  system 
has  concentrated  on  the  hardware  aspect  of  the  system.  This  approach 
leads  to  the  assumption  that  the  software  is  100  percent  reliable. 
Since  this  is  unlikely,  it  is  necessary  to  include  software  in  the 
system  reliability  calculations  [0NEI83].  Finally,  the  need  for  human 
interaction  with  a  computer  system  (to  detect    and    correct  problems, 
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restart  the  system,  input  key  information,  etc.)  makes  its  inclusion 
in  the  determination  of  system  reliability  a  necessity. 


2.3     Reliability  Requirements 

The  computer  system  (application)  objectives  and  the  environment 
in  which  it  operates  are  major  considerations  in  determining  the  level 
of  reliability  required  of  the  system.  An  important  question  to  be 
answered  is:     "How  reliable  must  the  system  be?" 

This  section  outlines  several  factors  that  contribute  to 
answering  this  question.  The  discussion  is  in  two  parts:  operational 
criteria  -  factors  associated  with  the  operational  setting  of  the 
system  and  which  are  affected  by  the  reliability  of  the  system;  and 
risk  analysis  -  a  method  for  balancing  the  degree  of  system 
reliability  against  acceptable  levels  of  loss  due  to  a  less  reliable 
system. 


2.3.1     Operational  Criteria  - 

Operational  criteria  are  those  characteristics  of  the  system  that 
make  reliability  more  or  less  important.  The  identification  of  these 
factors  and  their  relationship  to  reliability  is  necessary  in 
evaluating  the  system  reliability.  At  least  the  following  factors 
should  be  evaluated. 

1.  Safety.  Reliability  is  critical  to  a  system  where  there  is  a 
potential  for  loss  of  life,  health,  destruction  of  an  property,  or 
damage  to  the  environment. 

Examples:  health  care  systems,  scheduling  of  safety  inspections, 
power  system  controls,   air  traffic  control. 


2.  Security.  Reliability  is  a  fundamental  element  to  the  security  of 
computer  systems.  A  failure  can  decrease  or  destroy  the  security 
of  the  system.  Undesirable  events  such  as  denial  of  information, 
unauthorized  disclosure  of  information,  or  loss  of  money  and 
resources  can  result  from  lack  of  reliability  [FIPS31,  FIPS73, 
H0PK80,   PATR78,  SHAW8I]. 

Examples:  military  command  and  control,  eletronic  funds  transfer, 
management  of  classified  information,   inventory  control. 

3.  Access.  Reliability  becomes  a  major  concern  to  systems  when  it  is 
unusually  costly  or  impossible  to  access  that  system.  Reliability 
techniques  are  used  to  minimize  the  potential  failures  that  may 
render  the  system  useless.  These  systems  are  usually  very 
expensive  with  reliability  costs  a  fraction  of  the  overall  system 
costs. 
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Examples:  remotely  operated/controlled  systems  (space  shuttle, 
missiles,  satellites) 

4.  Mode  of  operation.  Reliability  has  varying  levels  of  importance 
depending  on  the  mode  of  operation.  Failures  affect  real  time, 
on-line,  and  batch  applications  differently.  Real  time 
applications  are  immediately  affected  by  a  failure.  Similarily,  a 
system  which  fails  while  supporting  an  on-line  application  will 
demonstrate  a  deviation  from  expected  conditions  sooner  than  the 
same  system  operating  in  exclusively  batch  mode. 

Examples:  data  management  systems,  centralized  information 
systems,  air  traffic  control,  computer  service  bureau 

5.  Organizational  dependency.  The  importance  of  reliability 
increases  as  the  organization's  dependence  on  the  computer  system 
and  its  services  becomes  more  critical  to  the  success  of  that 
organization.  A  failure  can  directly  affect  the  organization  by 
creating  delays  or  disruptions  to  production  schedules, 
administrative  activities,  management  decision  making,  etc.. 

Examples:     all  systems 


2,3.2     Risk  Analysis  - 

A  balance  between  the  application  reliability  requirements  and 
the  cost  of  designing  and  implementing  a  system  needs  to  be  evaluated. 
The  system  planner  should  be  aware  that  for  some  applications  a 
failure  and  its  recovery  may  cost  less  than  achieving  an  increase  in 
the  system  reliability  (prevention  of  a  failure).  A  risk  analysis 
approach  should  be  used  to  determine  the  affect  of  a  failure  and  its 
recovery,  and  the  level  of  reliability  sufficient  for  that  system. 
The  three  key  elements  to  be  considered  in  such  an  analysis  are: 

o    The  amount  of  damage  which  can  result  from  a  failure, 

o    The  likelihood  of  such  an  event  occurring, 

o    The  cost-effectiveness  of  existing  or  potential  safeguards. 

Previous  guidelines  (NBS  FIPS  PUBS  31  and  65,)  deal  extensively  with 
risk  analysis  for  automatic  data  processing  systems.  Although,  these 
guidelines  pertain  to  computer  security,  the  risk  analysis  techniques 
and  procedures  can  be  applied  in  the  case  of  reliability  as  well. 
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3.0     EVALUATING  RELIABILITY 

In  order  to  assure  that  the  computer  system  meets  or  exceeds 
performance  requirements,  the  system  planner  must  be  able  to  assess  or 
specify  the  reliability  of  the  computer  system.  This  section 
describes  system  reliability  data  gathering,  analysis,  and  assessment 
results.  The  data  (section  3.1)  obtained  about  the  system  are  used  as 
input  to  the  reliability  metrics  (section  3.2),  which  in  turn,  are 
used  to  derive  policies  and  performance  criteria  about  the  system's 
reliability  (section  3.3). 


3.1     Sources  Of  Reliability  Data 

Reliability  information  can  be  obtained  from  a  number  of  sources. 
The  data  can  be  derived  from  job  accounting,  system  performance,  error 
routines  that  are  part  of  the  operating  system,  diagnostic  routines, 
operator  logs,  hardware  and  software  monitors,  and  system  users.  A 
host  of  computer  performance  evaluation  tools  and  capacity  planning 
tools  can  also  be  used  to  acquire  data  about  the  system.  Information 
about  specific  performance  tools  can  be  obtained  from  CPEUG,  the 
Computer  Performance  Evaluation  Users  Group*^  ;  CMG,  the  Computer 
Measurement  Group'*';  and  publications  such  as  "Management  Control  of 
EDP  Performance"  and  "EDP  Performance  Review",  both  published  by 
Applied  Computer  Research.  Information  about  capacity  planning  tools 
can  be  found  in  reference  [KELL83]. 

Whatever  the  source  of  reliability  data,  it  is  important  to  keep 
accurate,  timely,  and  complete  records.  These  records  form  the  basis 
for  assessing  the  reliability  of  the  system.  Typical  data  elements 
that  should  be  recorded  are: 

o  the  date  and  time  of  any  event  evincing  a  reliability  problem, 

o  type  of  event, 

o  amount  of  time  lost  (if  any), 

o  the  system  and  responsible  component, 

o  average  service  and  response  time  for  a  job, 

o  number  of  jobs  and  job  mix  at  a  given  time,  and 

o  the  system  resources  used  for  these  jobs. 


*  For  information  write:  CPEUG,  at  B266  Technology,  National  Bureau 
of  Standards,  Gaithersburg  MD  20899;  CMG,  11242  North  19th  Avenue, 
Phoenix,  AZ  85029 
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This  list  is  not  meant  to  be  all-inclusive  nor  does  every  record  need 
to  contain  each  of  the  above  elements.  (Examples  of  these  data 
elements  and  their  derivations  can  be  found  in  the  following 
sections) . 

A  continuous  record  of  system  performance  and  activity  provides 
the  system  planner  with  historical  data  for  evaluating  system 
reliability.  This  information  will  enable  the  planner  to  base  future 
acquisition,  current  operation  procedures,  and  maintenance  decisions 
on  past  system  reliability  and  performance. 

]he  responsibility  for  recording  and  reporting  the  system 
reliability  information  should  be  clearly  delineated.  The  recording 
and  reporting  procedure  should  be  reviewed  periodically  for 
duplication  and/or  missing  elements.  It  is  suggested  that  records  be 
maintained  for  at  least  six  months.  Actual  time  frames  for 
maintaining  these  records  should  be  determined  by  the  system  planner 
based  on  the  system's  reliability  and  performance,  as  well  as  on  the 
usage  of  the  records  within  the  organization  (e.g.  to  take 
contractual  action  against  a  vendor). 


3.1.1     Accounting  Logs  - 

Accounting  logs  provide  performance  information  along  with  billing 
information.  Accounting  logs  usually  contain  data  about  individual 
programs  as  well  as  system  usage  [BOUH79].  The  type  of  data  and  depth 
of  detail  can  vary  among  computer  systems.  A  few  examples  of  possible 
data  elements  are  listed  below: 

0  program  data:  initiation  and  termination  time,  total  service  time 
for  each  used  resource  (e.g.  CPU,  disk),  memory  used,  I/O  counts, 
and  user  identification, 

o  system  data:  system  configuration,  software  parameters, 
checkpoint  records,  and  device  errors. 

Analysis  programs  use  accounting  logs  to  produce  system 
performance  reports.  These  reports  enable  the  system  planner  to 
recognize  deviations  from  normal  system  usage  and  to  evaluate  the 
impact  of  a  failure  by  observing  and  contrasting  the  system 
performance  prior  to,  and  subsequent  to,  a  failure.  Figure  1  is  an 
example  of  one  type  of  report  that  can  be  generated. 
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Device  Type:  Magnetic  Tape 
Unit-Serial  Model  Vendor  Date 
Tape  05-189  3420  Jul  10 


NUMBER  AND  TYPE  OF  FAILURES 


This  Bonth 
to  date 

Today  high  total 

Previous  Performance 

Prior  5  Days     Prior  nonth 
-1   -2  -3  -4     high  total 

Hard  Fails 
Soft  Fails 

3      3  13 
37    29  203 

0    1    0    0       3  IB 
6   15     0   13      819  3505 

-  DAILY  THRESHOLD  LEVELS  EXCEEDED: 


Hard  =  0    Soft  =  0 


-  DAILY  FAILURE  LOG: 

Unit-serial    Jobnane  Volser  MM/DD  HH. MH-HH. MM  Cpu  Failure    Record^/  Density 

Tape  -05-1189  LMSPUOTO  000000  07/10  12.55-12.55  189  Eqpt-rd  003  6250 
00000000  MITSTP  07/10  11.42-11.42  189  Eqir-27  7098  1600 
Landunpe  010758  07/10   00.13-00.13  -89  Data- wr      3914  6250 

NOTE:  -  Used  to  identify  devices  exceeding  threshold  values. 
Includes  device  hard  failure  log  for  total  picture 


(a)  DAILY  DEVICE  FAILURE  REPORT 


FAILURE  TYPE 

USAGE  DATA 

RATIOS 

//Hard 

//Soft 

Total 

Use/ 

Use/ 

Device  type  Model 

fails  fails 

Hardfail 

Softfail 

this  nonth 

Disk  storage  332 

4 

147 

6400 

1600.0 

40.0 

prior  nonth 

13 

273 

6200 

407.6 

22.8 

this  no. 

Disk  storage  335 

2 

88 

4104 

2052. 0 

46.7 

prior  no. 

4 

321 

3625 

906.2 

11.3 

this  no. 

Magnetic  tape  189 

IB 

3505 

4526 

251.4 

1.2 

prior  no. 

13 

1597 

5616 

432.0 

3.5 

(b)  MONTHLY  DEVICE  SUMMARY 


Figure  1:  System  performance  reports  —  Examples  of  one  type  of 
accounting  information  analysis 
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3.1.2    System  Incident  Reports  (SIR)  - 

System  incident  reports  are  generated  by  the  operations  staff  whenever 
a  problem  occurs.  The  SIR  calls  for  full  information  including  time 
of  day,  system  status,  tasks  and  jobs  in  the  system,  possible  cause 
(relating  it  to  the  hardware,  software  or  unknown),  diagnostic 
messages,  availability  of  core,  etc.  (figure  2).  The  final 
disposition  of  the  incident  and  routing  information  (if  any)  is  also 
provided.  The  completed  SIR  and  any  supporting  documentation  is  then 
circulated  to  appropriate  technical  personnel [ FIPS3 1 ] . 


3.1.3     Console  Operator  Logs  - 

Console  operator  logs  are  an  operator  maintained  account  of  the 
system's  daily  activity  (figure  3).  Typical  information  recorded  in 
these  logs  includes:  operator  actions  (e.g.  boots,  mounts,  backups), 
system  configuration,  outages,  crashes,  downtime,  malfunction  of 
peripherals,  and  routine  and  corrective  maintenance  repairs.  The 
system  planner,  operator,  and  maintenance  personnel  can  use  the 
information  contained  in  the  operator  logs  to  analyze  daily  activity, 
identify  problem  areas,  track  reliability  control  procedures,  and 
evaluate  reliability  metrics.  Summary  reports,  such  as  weekly  log 
reports  (figure  4),  efficiency  reports  (figure  5),  utilization 
statistics  (figure  6),  and  failure  categorization  reports  (figure  7) 
can  be  derived  from  the  logs  and  used  to  evaluate  the  system. 


3.1.4     System  Error  Messages  - 

System  error  messages  are  automatically  generated  by  the  system  and 
often  provide  clues  to  the  source  of  an  error.  Relevant  information 
pertaining  to  the  error(s)  is  recorded.  Such  information  may  include: 
time  of  day,  error  type,  control  limits  exceeded  (exception  reports), 
consistency  checks,  timers,  and  selected  traces  and  dumps.  Many 
systems  automatically  log  the  error  messages  and  related  data. 
Analysis  programs,  available  from  system  vendors  or  other  commercial 
sources  are  used  to  extract  the  relevant  reliability  information. 


3.1.5    Diagnostic  Routines  - 

Diagnostic  routines  provide  information  on  the  integrity  of  the  system 
by  identifying  failures  or  indicating  (by  the  absence  of  failures) 
that  the  system  is  operating  correctly.  The  routines  can  be  run 
periodically  or  subsequent  to  the  occurrence  of  a  problem.  The 
routines  can  provide  information  about  the  problem  type  and  location. 
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SYSTEM  INCIDENT  REPORT 


Down  CPy  

Date   Tine  — 

Unit  .  

Reference  Syst«n  Systen  Subassenbly 
Har   Avail    Return    Module   


Description  of  Problen 


Corrective  Action 


Diagnostic  Routine   Service 

Diag.  Fail:  Yes      No  Person 


Fic^ire  2:  System  Ir%;ident  Report 


lOOPCCn,  29-Jun-1984  12:28:28.72.  nessage  fron  user  NET^WP 
NET  shutting  down 

%OPC0n,  29-Jun-1984  12:33:54.07.  operator  disabled 
%OPC0n.  29-Jun-1984  12:35:15.47,  operator  enabled 
iOPCOn,  29-Jun-1984  12:45:05.22,  operator  status 
PRINTER,  TAPES,  DISKS,  DEVices 

%OPCai,  29-JUN-1984  12:48:53.21,  request  fron  user  PUBLIC 

Please  nount  volune  KLAT  in  device  tlTAO: 

*OPOCn,  29-0UN-1984  12:50:02.11,  request  satisfied 

%OPC0n,  29-JUN-1984  12:50:03.54,  nessage  fron  user  SYSTEH 

Volune  KLAT  nounted,  on  physical  device  tITAO: 

%OPCai,  29-Jun-1984  13:01:26.91,  device  LPO  is  offline 

%OPC0n,  29-Jun-1984  13:31:15.63,  request  fron  user  PUBLIC 

rtount  new  relative  volune  2  ()  on  tITA: 

lOPCOn,  29-Jun-1984  13:33:45.05,  nessage  fron  user  SYSTEH 

flTA:  in  use,  try  later,  nount  aborted 

%OPCOn,  29-Jun-1984  13:46:21,67,  nessage  fron  user  SYSTEH 

Current  systen  paraneters  nodified  by  process  ID  001f003C 

lOPCOn,  29-Jun-1984  13:46:21:97,  device  0SK4  is  offline 

Problens  with  DSK04 

Problens  with  DSK04 

%OPCCn,  29-Jun-1984  13:47:01:30,  nessage  fron  user  SYSTEH 
DSK04  hass  been  renounted  -  back  online 


Figure  3:  Sample  operator  log 


SYSTEh  LOG  REPORT  FOR  THE  WEEK  ENDING  July  9 


1.   System  Utilization  for  the  week 


Tine  sharing  with  operator  coverage  128:18 

Tine  sharing  without  operator  coverage  16:35 

Regular  field  service  Pti's  13:55 

Extra  field  service  4:05 

Conputer  operation  -  stand  alone  3:09 

Lost  tine  1:58 

TOTAL  HOURS  168:00 


2 .  Equipment 

Hardware  problens  contributing  to  systen  downtine: 
HFIO  -  down,  nenory  parity  errors 
DFllO-TniO  -  problens  occurred  when  using  ni(\  drive 

Hardware  problens  not  causing  systen  downtine: 
LPAl  -  replaced  hanner  nodule  col.  35 
TU56  -  tightened  hub 

3.  Reruns  and  Lost  time 

-246- 

Estinated  lost  tine  on  systen  246  was  20  hours  and  25  nin. 
-541- 

Estinated  lost  tine  on  systen  541  waass  7  hours  and  45  nin. 

4.  Monitor  problems 

Detail 

One  job  running,  nost  o^er  jobs  swapped 

in  Run  state 
On  PI  4  interrupt  ohein 
Dubious  crash  data 

Host  jobs  waiting  for  disk  nonitor  buffer 
or  disk  I/O  wait 


Nane 

Date 

Problen 

RF669N 

5Jul 

Hung 

RF6B9N 

5Jul 

Loop 

RF6B9N 

7Jul 

Loop 

RF6B9N 

8Jul 

Hung 

Job  Distribution  (average  number  of  Jobs) 

0700-0900   0900-1300    11300-1700    1700-1900  1900-2400 


7/4 

22.70 

48.05 

50.20 

40.00 

32.00 

7/5 

30.90 

53.89 

54.95 

40.70 

34.43 

7/6 

27.15 

47.50 

55.50 

39.40 

34.50 

7/7 

30.15 

44.15 

49.90 

32.50 

19.14 

7/8 

31.80 

49.89 

46.12 

22.90 

23.20 

Figure  4:  Weekly  log  report 
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WEEKLY  EFFICIENCY  REPORT  (July) 


System 

Performance  during  scheduled 
hours 

Scheduled 

Actual 

Eff  iciency/uik 
{%  uptime) 

Date 

turned 
off 

down  due 
to 

software 

down  due 
to 

hardware 

other 

total  nun. 
of 

hours  down 

time 
(hours) 

system 
up  time 
(hours) 

1-3 

7 

1 

51 

1:03 

:40 

3:34 

43:30 

39:56 

92% 

5-10 

12-17 

19-24 

7 
8 
3 

1 

48 
48 
27 

:43 
8:53 
6:25 

:57 

2:31 
10:38 
6:52 

95:30 
95:30 
95:30 

92:59 
84:52 
88:38 

97,3% 
88.8% 
92.8% 

26-31 

6 

30 

:29 

2:52 

3:51 

98:30 

94:39 

94.6% 

total 

31 

5 

:24 

17:33 

4:29 

27:26 

428:30 

401:04 

94% 

system  operational  6  days/week,  24  hours/day 


Figure  5:  Efficiency  report 
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SYSTEM  UTILIZATION  STATISHCS  (July) 


Reloading  System  for  Prcxsssirej 


^teHJ^s 


Pecent  of  Total 


Down  -  System  not  operational  due  to 
hardware  or  software  failure 


22:57 


k.2% 


Site  -  Down  due  to  electrical,  air 
conditioning,  water  damage,  etc, 


3:31 


.6% 


P.M.  -  Scheduled  preventive 
maintenance 


20:00 


3.6% 


Unscheduled  maintenance 


0:00 


0% 


Off  -  System  shut  off 


31:00 


Idle  -  Work  to  be  processed.  Put  no 
one  availaiDle  to  process  it 


30:00 


5.4% 


Development  -  System  up,  but  not  for 
 "public  use"  


42:34 


7.7% 


Public  use  -  System  operational  for 
all  users 


401:04 


72.8% 


Figure  6:  Utilization  statistics 


NUMBER  OF  FAILURES  BY  CAUSE  (July) 


Total  Number    %  of  Total 


Hardware 

212 

44 

Software 

106 

22 

Application 

10 

2 

Operations 

29 

6 

Environmental 

77 

15 

Unknown 

10 

2 

Reconfiguration 

39 

8 

483 

100 

Figure  7:  Failure  categorization 
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3.1.6     Hardware  And  Software  Monitors  - 

Hardware  monitors  are  electronic  devices  that  are  physically  attached 
to  the  computer  system  and  software  monitors  are  software  programs 
residing  in  and  utilizing  some  or  all  of  the  host's  resources.  Both 
types  of  monitors  make  measurements  on  the  system  by  recording, 
analyzing  and/or  presenting  data  under  real  time  operation.  The 
performance  level  of  system  resources  as  well  as  any  problems  that 
might  occur  can  be  pinpointed  and  tracked  with  respect  to  their  cause 
and  effect  within  the  system.  For  example,  in  the  data  communications 
area,  measurements  of  response  time  and  communication  line  utilization 
can  be  used  to  identify  and  locate  potential  problem  areas  [JAC081]. 


3.1.7     User's  Level  Of  Satisfaction  - 

User's  level  of  satisfaction  with  the  system's  performance  can  provide 
an  indication  of  the  system  reliability.  User  complaints  and 
questions  can  aid  the  system  planner  (analyst,  operator,  etc.)  in  the 
identification  of  problem  areas.  Interaction  with  users  may  be  a 
formal  or  informal  procedure.  Interaction  may  include:  joint  system 
staff/user  meetings,  surveys  of  the  user  community  (e.g.  ask  about 
possible  problem  areas),  or  user  requests  for  refund  of  purchased 
computer  services  (an  indication  of  possible  system  problem  areas). 


3.1.8     System  Performance  Meetings  - 

System  performance  meetings  provide  the  opportunity  for  appropriate 
personnel  (system  managers,  operators,  analysts,  technicians)  to  meet, 
discuss,  and  analyze  the  system  performance.  The  reports  and 
information  obtained  from  the  above  sources,  as  well  as  any  additional 
data,  form  the  basis  of  the  system  reviews.  The  members  of  the 
meeting  try  to  identify  the  system  components  which  fail  most 
frequently,  the  cause  of  the  failures,  and  solutions  to  minimize  or 
eliminate  future  occurrences  of  such  events. 


3.2     Reliability  Metrics 

Reliability  metrics  provide  a  quantitative  basis  for  the 
assessment  of  the  computer  system  reliability.  The  actual  measurement 
is  accomplished  by  applying  data  gathered  about  the  system  as  input  to 
the  reliability  metrics.  The  data  can  be  obtained  from  the  system 
planner's  in-depth  knowledge  of  the  system's  capacity  and  activity  and 
formal  inspection  of  the  system  components,  as  well  as  the  sources 
cited  in  section  3.1. 

Numerous  quantitative  methods  exist  to  measure  the  reliability  of 
the  computer  system.  Most  metrics  for  system  reliability  are  derived 
from  a  combination  of  hardware  and  software  measurements  [BESH83f 
CAST81,     HERN83,     SRIV83,     THOM83].     Due  to  the  complexity  of  computer 
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systems,  a  variety  of  reliability  metrics  should  be  chosen  to  describe 
the  system  adequately.  The  development  and/or  identification  of 
appropriate  metrics  is  not  an  easy  task.  Often  it  is  necessary  for  a 
reliability  expert  to  identify  a  set  of  metrics  and/or  to  develop 
mathematical  models  (algorithms)  to  describe  the  system  in  terms  of 
probabil ities , 

A  quantitative  value  or  threshold  level  consisting  of  either  a 
number,  range,  or  percent  should  be  established  for  each  measurement. 
These  values/levels  can  be  established  in  accordance  with: 

o  Department  of  Defense  standards  [MIL217,  MIL757,  MIL781 ] 

o  comparison  to  similar  systems 

o  system  specifications  by  vendors  and/or  GSA  schedule 

o  specific  application  requirements 

Comparisons  of  these  pre-stated  values  with  the  actual  derived 
measurement  values  will  be  helpful  in  assessing  the  reliability  of  the 
computer  system.  Note,  it  is  not  always  possible  to  establish  a 
mathematical  value  for  all  measurements.  In  these  cases,  it  is 
advisable  to  develop  a  relative  importance  rating  (priority  factor)  to 
indicate  its  value  [PERS83]. 

The  remainder  of  this  section  presents  an  overview  of  system 
reliability  metrics.  The  discussion  will  be  divided  into  three 
categories:  hardware,  software,  and  human  measurements.  The 
objective  is  to  identify  the  basic  concepts  and  underlying  attributes 
associated  with  the  metrics  of  the  various  categories.  Detailed 
analysis  of  specific  metrics  are  beyond  the  scope  of  this  document, 
but  additional  references  are  given  for  each  category. 


3.2.1     Hardware  Measurements  - 

There  has  been  an  abundance  of  information  written  on  hardware 
reliability  metrics.  It  is  these  metrics  that  are  the  most  familiar 
and  are  thought  of  as  the  'traditional'  measurements.  The  metrics  are 
used  to  assess  the  mechanical  or  electrical  elements  of  the  computer 
system  and  have  been  used  as  the  original  tools  for  the  evaluation  of 
total  computer  system  reliability. 

The  hardware  metrics  are  a  means  of  evaluating  the  amount  of 
processing  time  lost  due  to  the  failure  of  the  computer  system  or  a 
specific  component.  The  calculation  of  the  reliability  measurement 
will  vary  with  the  complexity  of  the  system  configuration.  Although 
the  basic  concepts  will  remain  the  same,  the  hardware  reliability 
measurements  for  a  single,  non-redundant,  non-repairable  system  will 
differ  from  that  of  redundant,  repairable,  and/or  distributed  system 
conf igur ationSo  Metrics  for  the  latter  must  compensate  for  the 
special  properties  (e.g.     replication  of  components)  of  the  system. 
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There  are  two  approaches  to  estimating  hardware  reliability;  one 
is  based  on  statistical  probability  distribution  models,  and  the  other 
is  based  on  actual  system  performance.  The  probability  model  is  the 
analytical  basis  for  making  reliability  predictions.  The 
determination  of  an  appropriate  model  is  necessary  to  achieve 
realistic  predictions,  and  should  be  developed  by  an  expert. 
Prediction  tables  [MIL217,  MIL757,  MIL781 ]  and  other  literature 
sources  [BEAU79,  LAWL82,  SIEW82]  can  provide  background  and  general 
reliability  models. 

Quantitative  metrics  based  on  an  operational  system  can  provide 
information  on  the  processing  time  lost  due  to  the  failure  of  the 
computer  system  or  a  specific  component.  Among  the  measurements  of 
interest  are:  the  number  of  times  the  hardware  ceases  to  function  in 
a  given  time  period  (Failure  rate),  the  average  length  of  time  the 
hardware  is  functional  (mean  time  between  failures,  MTBF) ,  the  amount 
of  time  it  takes  to  resume  normal  operation  (mean  time  to  repair, 
MTTR),  and  the  quantity  of  service  (availability).  Although  a 
simplistic  model,   figure  8  depicts  some  of  these  measurements. 

Other  measurement  algorithms  and  analysis  techniques  might 
include  calculations  to  determine: 


o    a  level  of    confidence     in    the    system's    ability     to    survive  a 
failure, 

o    the  number    of     intructions    that    could    be    processed     before  a 
failure, 

o    the  amount  of  time  the  system  will  be  inoperable, 

o    the  response  time  delay. 

Comprehensive  descriptions  of  hardware  metrics  and  analysis  techniques 
can  be  found  in  [BEAU79],   [LAWL82],   [0DA81],  and  [SIEW82] 


3.2.2     Software  Measurements  - 

There  is  a  tendency  to  use  hardware  metrics  to  evaluate  the 
software  component  of  the  computer  system.  Although  use  of  these 
metrics  may  be  appropriate  in  a  few  cases,  it  can  limit  the  scope  of 
the  software  evaluation.  This  limitation  is  due  to  differences 
between  hardware  and  software  failure  origination  and  repeatability. 
For  example,  hardware  failures  are  either  transient  or  repeatable,  and 
result  from  either  design,  development,  and  component  fault,  whereas 
software  failures  are  almost  always  repeatable  and  originate  in  design 
and  development  [ONEI83]. 


failure 


failure 


H^jjp  _  the  nunber  of  tine  units  the  systen  is  operable  before 
the  first  failure  occurs 

sun  of  the  nunber  of  tine  units  the  systen  is  operahlft 

MTBF  = 


1TTR  = 


nunber  of  failures  during  the  tine  period 

sun  of  tfie  nunber  of  tine  units  required  to  perforn  systen  repair 
nunber  of  repairs  during  the  tme  period 


*MTTF  applies  to  non-repairaDle  systems  and  is  not  applicaPle  after 
the  first  failure.    (Some  experts  consider  MTTF  to  be  a  special  case 
of  the  MTBF  measure) 


Figure  8:  Measures  of  MTTF,  MTBF,  MTTR  and  availablity 

The  time  line  illustrates  the  various  measurements  with 
respect  to  the  recognition  and  repair  of  failure  occurrences 
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Software  reliability  calculations  can  be  performed  throughout  the 
system  life  cycle  to  quantify  the  expected  or  the  actual  reliability. 
In  the  early  phases  of  development,  the  measurements  can  be  applied  to 
the  documentation  on  the  system  concepts  and  design;  and  in  later 
phases,  to  both  the  documentation  and  the  source  code.  The  metrics 
should  measure  errors  caused  by  deficiencies  or  the  inclusion  of 
extraneous  functions  in  the  system  design  specification, 
documentation,  or  source  code.  The  metrics  should  be  limited  to 
errors  caused  by  software  deviating  from  its  specifications  while  the 
hardware  is  functioning  correctly.  Any  error  that  occurs  several 
times  before  detection  and  correction,  should  be  charged  as  a  single 
error  in  the  reliability  calculations. 

Software  measurements  are  used  to  predict  or  quantify  the 
software  quality  of  the  system.  The  measurement  can  be  calculated  by 
either  the  evaluation  of  past  success  or  the  prediction  of  future 
failure.  One  method  of  calculating  software  reliability  might  be  to 
count  the  number  of  errors  that  occur  in  the  source  code  [PRES83], 
e.g. 

^  _  Number  Of  Errors 

Total  Number  of  Lines  of  Executable  Code 


(Rating  is  in  terms  of  the  expected  or  actual  number 
of  errors  that  occur  in  a  specified  time  interval). 

Other  measurement  algorithms  or  analysis  techniques  might  include 
calculations  to: 

o  define  the  levels  of  error  occurrence  and  tolerance, 

0  determine  the  percentage  of  errors  during  a  time  interval, 

o  identify  error  types  and  the  modules  in  which  they  occur, 

o  identify  deficiencies  in  the  documentation  or  code. 

Comprehensive  descriptions  of  software  metrics  and  descriptions  can  be 
found  in   [ONEI83],  [PRES83], 


3.2.3     Human  Measurements  - 

Human  reliability  and  its  influence  on  the  computer  system  is  a 
developing  discipline.  Although  new  models  are  being  developed,  there 
has  been  a  shortage  of  human  reliability  metrics,  a  general  lack  of 
understanding  of  the  analysis  required,  and  an  absence  of  pertinent 
reliablity  data  [LASA83]. 

Human  reliability  metrics     differ     from     those     for     hardware  or 
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software.  Differences  stem  from  the  ability  of  the  human  to  make 
decisions,  to  learn  from  one's  experiences,  and  to  continue 
functioning  in  spite  of  a  mistake  ('failure')  [SRIV83].  Thus,  the 
metrics  need  to  model  a  human's  ability  to  work  under  different 
situations.  A  recognized  approach  [SRIV83]  is  to  divide  the  metrics 
into  two  fundamental  components: 

o     the  'performance'   element  -  a  task  is  completed,   with  no  decision 
required,  and 

o     the  'control'   element  -  a  task  consists  of  several  parameters  and 

requires  a  decision  to  be  made. 

The  failures  that  should  be  assessed  by  the  metrics  include: 
incorrect  diagnosis,  misinterpretation  of  instructions,  inadequate 
support  or  environmental  conditions,  and  insufficient  attention  or 
caution.  Algorithms  and  analysis  techniques  might  include 
calculations  to: 

o     determine  improper  human  (operator,   user,   etc.)  performance, 

o     determine     the    amount     of     downtime      caused      by  human/machine 
interaction, 

0     identify  the  number  of  errors     that    were    manually     detected  and 
corrected , 

o     identify  the  errors  caused  by  human  alteration  to  the  system, 

o    evaluate  human/machine  interaction    -     amount    required,     cost  to 
implement,   and  time  to  accomplish. 

Comprehensive  descriptions  of  human  reliability  metrics  can  be  found 
in  [LASA833  and  [SRIV83]. 


3.3     Assessing  The  Quality  Of  The  Computer  System 

The  analyses  of  the  information  obtained  from  the  reliability 
metrics  (in  the  previous  section)  can  be  used  individually,  in 
combination,  or  in  conjuction  with  historical  system  data  to  evaluate, 
estimate,  or  predict  the  reliability  of  the  computer  system. 
Specifically,  the  system  planner  can  utilize  this  derived  information 
to : 

0     establish  performance  and  acceptance  criteria, 

o     formulate  policies  to  reflect  or  achieve  reliability  goals, 


o     gather  information  on  the  effect  of  a  failure  on     the     system  and 
organization  [CAST81],  and 
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o     clarify  and  identify  specific  failures  or  problems. 

Examples  of  assessments  that  can  be  made  about  the  system  are  given  in 

the     following     paragraphs,  A  chart     listing  these  examples  and  the 

measurement  class  that  might  be  associated  with  them  is  given  by 
figure  9 . 
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3.3.1     Performance/Acceptance  Criteria  - 

The  following  criteria  can  be  used  in  the  specification  and  evaluation 
of  performance  levels  in  contracts  with  vendors  and/or  in-house  system 

pel ici  es , 

Threshold  Value  Assessment  is  the  comparison  of  pre-established  metric 
values  with  the  actual  derived  value  of  the  metric.  The  technique  is 
used  to  indicate  if  the  measurement  exceeds,  meets,  or  falls  short  of 
expected  levels  of  performance.  It  is  a  method  that  can  be  applied  to 
all  measurement  types  and  provides  a  means  of  specifying  the  minimum 
performance  level  that  is  to  be  achieved. 

General  System  Availability  is  the  amount  of  time  the  overall  computer 
system  is  operational  and  usable.  The  achievement  of  a  predetermined 
availability  threshold  can  be  used  to  indicate  acceptable, 
substandard,  or  unacceptable  performance  levels,  A  chart  should  be 
developed  to  indicate  the  limits  for  acceptable,  substandard,  and 
unacceptable  performance  of  the  system  with  values  based  on 
availability  requirements.  For  existing  systems,  the  derived  metric 
/values  should  be  compared  against  the  required  levels  listed  in  the 
chart.  The  chart  below  is  a  hardv/are  oriented  example  of  system 
availability  performance  limits  (using  hours  of  downtime  in  a  computer 
system) . 

HOURS  OF  SYSTEn  DOWNTIME 

Subsystem  causing       Acceptable       Substandard  Unacceptable 
downtime 


CPU  +  Memory  0-16.9 

Disk  Storage  C-15.9 

Magnetic  Tape  0-8.4 

Printer  0-8. 4 


17.0-33.9  >34.0 

17.0-33.9  >34.0 

8.5-16.9  >17,0 

8.5-16.9  >17.0 


Note:  The  ranges  listed  are  for  illustration  purposes  and  are  not 
fneant  to  be  recommended  values  for  any  particular  system). 

General  System  Stability  is  the  average  amount  of  time  the  system  is 
operational  before  user  services  are  interrupted,  loss  of  work 
results,  or  a  system  reboot  is  required.  The  determination  of  system 
stability  can  be  derived  from  the  number  of  system  interruptions  (e»g. 
measurements  that  indicate  the  number  and  length  of  time  the  system  is 
unavailable  for  use).  A  malfunction  or  failure  that  does  not  result 
in  system  interruption  is  ignored  for  system  stability  determination, 
A  chart  should  be  developed  to  indicate  the  limits  of  performance  of 
the  system  with  values  based  on  stability  requirements.  For  existing 
systems,  the  derived  metric  values  should  be  compared  against  the 
required  levels  listed  in  the  chart.  For  example,  the  chart  below 
illustrates    acceptable,     substandard,     and    uanacceptable  performance 
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levels  for  several  subsystems  during  a  30  day  period. 

NUHBER  OF  SYSTEM  INTERRUPTIONS 


Subsystem  causing 
downtime 


Acceptable 


Substandard 


CPU  +  Memory 
Disk  Storage 
Magnetic  Tape 
Software  Module  1 
Software  Module  2 


0-9 

0-9 

0-4 

0-12 

0-12 


10-19 
10-19 
5-9 
13-24 
13-24 


Unacceptable 

>20 
>20 
>10 
>25 
>25 


(Note:  The  ranges  listed  are  for  illustration  purposes  and  are  not 
meant  to  be  recommended  values  for  any  particular  system). 


General  Survivability  is  the  probability  that  the  system  will  continue 
to  perform  after  a  portion  of  the  system  becomes  inoperable.  A 
numerical  value  or  importance  level  should  be  established  and  used  to 
indicate  the  acceptable  and/or  required  levels  of  survivability. 
Survivability  can  be  derived  from  measurements  that  relate  to  the 
number  of  failures  (both  hard  and  soft  failures)  and  the  ability  of 
the  system  to  recover  from  the  failure.  In  addition,  measurements 
that  indicate  the  system  usage  and  the  amount  of  damage  that  could 
result  from  a  failure  can  influence  the  survivability  rating  and 
should  also  be  considered.  The  following  list  is  an  example  of  levels 
of  importance  associated  with  various  types  (hardware,  software,  etc.) 
of  subsystems.  (Note:  documentation  survivability  encompasses  the 
scope,  clarity,  completeness,  and  correctness  of  the  documentation 
that  will  enable  a  user  to  read,  understand,  and  perform  the  activity 
described  correctly). 


Subsystem 


IMPORTANCE  LEVELS 
Level  of  Importance 


Comnents 


CPU 

Tape  Drive  1 

Software  Module  1 
Software  Module  2 

Documentation 


high 
low 

high 

moderate 
moderate 


Level  depends  on  the  functional 
importance  and  usage  of  the 
device 

Level  depends  on  the  functional 
importance  and  usage  of  the 
module 

Level  depends  on  the  subject 
importance  and  the  usage  of 
the  documentation 


(Note:  The  levels  listed  are  for  illustration  purposes  and  are  not  meant 
to  be  recommended  values  for  any  particular  system). 
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3.3.2     Policy  Formulation  - 

The  values  obtained  from  the  reliability  measurements  are  used  in 
the  formulation  and  adjustment  of  reliability  policies.  ■ 

Maintenance  Policies  and  Procedures  should  be  examined  and  evaluated 
to  reflect  the  reliability  requirements  of  the  computer  system.  The 
system  planner  can  use  the  metrics  to  assess  the  effectiveness  of  the 
current  maintenance  policies  and  to  adjust  them  accordingly.  Almost 
all  the  reliability  measurements  can  be  used  to  indicate  system 
problems  and  are  helpful  tools  in  the  identification  of  potential  and 
actual  subsystem  failures.  In  addition,  the  logistic  delay,  delays 
encountered  while  waiting  for  parts  and/or  service  personnel  should  be 
included  in  the  considerations. 

Acquisition  Policies  should  be  examined  and  evaluated  to  reflect  the 
reliability  requirements  of  the  computer  system.  The  reliability 
measurements  are  indications  of  the  system  activity  and  quality,  and 
can  be  used  as  supporting  factors  in  the  justification  and 
specification  of  new  system  acquisitions  and/or  system 
reconfiguration. 


3.3.3     Information  On  The  Impact  Of  A  Failure  - 

The  more  dependent  an  organization  is  on  the  computer  system,  the 
greater  the  impact  a  failure  would  have  on  that  organization. 
Reliability  measurements  that  provide  information  about  the  frequency 
and  identity  of  system  failures  and  the  performance  level  of  the 
computer  system  are  used  in  the  assessment  of  the  impact. 

Productivity  and  Workload  Scheduling  is  the  scheduling  of  the  amount 
of  work  that  consumes  computer  resources.  A  system  not  functioning  to 
its  full  capacity  may  delay  or  prevent  the  processing  of  user  and 
system  jobs.  This  interruption  can  affect  productivity  and  product 
schedules  and  as  such,  translates  into  a  cost.  With  knowledge  of  the 
computer  systems  reliability,  a  system  planner  can  adjust  and  forecast 
current  and  future  workload  requirements  accordingly. 

Amount  of  'Backup'  Work  is  the  amount  of  work  performed  in 
anticipation  of  a  failure.  This  would  include  multiple  copies  of 
system  and  user  programs  and  data,  checkpoints  for  easy  restart,  and 
multiple  runs  of  identical  jobs.  Efforts  such  as  these  are  used  to 
circumvent  the  effects  of  a  failure  or  to  facilitate  recovery.  In 
general,  the  less  reliable  a  computer  system  is,  the  greater  the 
amount  of  'backup'  work  performed. 
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3.3.^     Clarification  And  Identification  Of  Failures  - 

The  combined  analysis  of  reliability  measurement  results  and 
accumulated  historical  system  data  is  a  means  of  identifying  the 
occurrence  of  specific  failures/problems  or  of  obtaining  early  warning 
indicators  of  potential  failures/problems.  This  knowledge  enables  the 
system  planner  to  take  the  appropriate  corrective  action  in  a  timely 
manner.  Of  particular  value  in  pinpointing  the  cause  of  the 
failure/problem  is  the  correlation  of  measurements  that  pertain  to  the 
type,  location,  and  frequency  of  the  failure/problem  with  the  system's 
resultant  action  (e.g.     crash,   recovery  -  retry  or  warm  start). 
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4.0     BASIC  TECHNIQUES 

Reliability  techniques  are  incorporated  into  a  computer  system  to 
reduce  the  errors  and  effects  resulting  from  the  corruption  of  data  or 
malfunction  of  the  hardware  during  system  operation.  The  techniques 
are  implemented  to  prevent,  offset,  or  correct  the  occurrence  of  one 
of  the  following  fundamental  categories  of  faults. 


1 .  Physical  faults.  The  disruption  of  the  information  processing 
function  due  to  a  hardware  malfunction  of  the  computer  and/or  its 
peripherals  [AVIZ79].  These  failures  occur  due  to  the  weakening 
and  breakage  of  the  components  over  a  period  of  time  and  usage. 

2.  Design  faults.  The  imperfections  in  the  system  due  to  mistakes 
and  deficiencies  during  the  initiation,  planning,  development, 
programming,   or  maintenance  of  the  computer  system  [AVIZ79]. 

3.  Interaction  faults.  The  malfunctions  or  alterations  of  programs 
and  data  caused  by  human/machine  interactions  during  system 
operations. 

The  remainder  of  this  section  presents  a  general  discussion  of 
basic  reliability  techniques.  The  selection  of  techniques  that  are 
applicable  for  a  given  system  will  depend  on  the  system  objectives  and 
configuration,  and  the  feasibility  of  implementing  the  technique.  The 
discussion  is  divided  into  two  parts:  design  features  and 
implementation  techniques.  Design  features  are  the  reliability 
techniques  designed  into  the  hardware  configuration  or  software  source 
code  by  the  system  developer.  Implementation  techniques  are  those  the 
system  planner  can  adopt  to  improve  the  reliability  of  the  system. 


4.1     Design  Features 

A  large  range  of  reliability  techniques  is  available  to  the 
designers  of  computer  systems.  The  goal  of  these  techniques  is  to 
keep  the  system  operational  either  by  eliminating  faults  or  in  spite 
of  the  presence  of  faults.  A  combination  of  reliability-enhancing 
features  may  be  used  within  a  single  system.  The  specific  techniques 
used  may  vary  among  systems  due  to  cost,  performance,  and  reliability 
trade-offs . 

Typically,  the  system  planner  does  not  designate  which  design 
techniques  are  to  be  incorporated  into  the  computer  system,  (The 
development  of  custom  designed  system  software  may  be  an  exception  to 
this  rule).  Despite  this  inability,  the  system  planner  should  be 
familiar  with  reliability  design  techniques  in  order  to  better  specify 
and  understand  the  reliability  capabilities  of  the  system,  A  list  of 
techniques  is  shown  in  Figure  10.     A  brief  explanation  of  several  of 
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FUNCTION 


TECHNIQUE 


Fault  Avoidance 


Fault  Tolerance 

Fault  detection: 


Masking  redundancy: 


Dynamic  redundancy: 


Environment  modification 
Quality  components 
Component  integration  level 
Verification  and  validation 


Duplication 

Error  detection  codes 
M-of-N  codes 

Parity 

Checksums 

Arithmetic  codes 

Cyclic  codes 
Self-checking  and  fail-safe  logic 
Watch-dog  timers  and  timeouts 
Consistency  and  capability  checks 

NMR/voting 

Error  correcting  codes 

Hamming  SEC/DED 
Masking  logic 

Interwoven  logic 

Coded-state  machines 
Recovery  Block 
N~version  Programming 

Reconfigurable  duplication 
Reconfigurable  NMR 
Backup  sparing 
Reconfiguration 
Recovery 


Figure  10:  Classification  of  reliability  techniques  [SIEW82] 


Page  30 


these  follows.  More  thorough  discussion  can  be  found  in  [CART79], 
[DENN76],   [McDE803,   and  [SIEW823 . 


4.1.1     Fault  Avoidance  - 

The  goal  of  a  fault  avoidance  approach  is  to  reduce  or  eliminate 
the  possibility  of  a  fault  through  design  practices  such  as  component 
burn-in,  testing  and  validation  of  hardware  and  software,  and  careful 
signal  path  routing.  The  approach  assumes  an  a  priori  perfectibility 
of  the  system.  To  achieve  fault  avoidance,  all  components  of  the 
system  (hardware  and  software)  must  function  correctly  at  all  times. 

Fault  Avoidance  Techniques: 

o  Environmental  factors.  The  elimination  of  faults  caused  by  heat 
produced  by  the  system's  circuitry. 

0  Quality  components.  The  acquisition  and  use  of  extremely  reliable 
components. 

o  Component  and  system  integration.  The  careful  assembly  and 
interconnection,  and  extensive  testing  and  verification  of 
individual  modules,   subsystems,  and  the  entire  system. 

o  Verification,  validation,  and  testing.  The  process  of  review, 
analysis,  and  testing  employed  throughout  the  software  development 
life  cycle  to  insure  the  correctness,  completeness,  and 
consistency  of  the  final  product  [BRAN81,  P0WE82]. 


4.1.2    Fault  Tolerance  - 

The  goal  of  a  fault  tolerance  approach  is  to  preserve  the 
continued  correct  execution  of  functions  after  the  occurrence  of  a 
selected  set  of  faults.  This  is  achieved  through  redundancy:  the 
addition  of  hardware,  software  or  repetition  of  operations  beyond 
those  minimally  required  for  normal  system  operation. 

Fault  Tolerance  Techniques: 


o  Watchdog  timers  and  timeouts.  A  process  must  reset  a  timer  or 
complete  processing  v/ithin  a  set  time  period.  Inability  to 
accomplish  this  task  is  an  indication  of  possible  failure. 
Neither  timers  or  timeouts  can  be  used  to  check  data  for  errors. 

0  N  Module  Redundancy  ( NMR ) /voting .  The  outputs  of  N  identical 
modules  are  compared.  By  the  use  of  majority  voting,  a  fault  can 
be  detected,  the  correct  output  selected,  and  processing 
continues.  The  most  common  NMR  technique  is  Triple  Modular 
Redundancy  (TMR)  (figure  11). 
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Figure  11:  Triple  Modular  Redundancy  (TMR)  System  with  voting 
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o  Error  correction  codes  (ECC),  The  representation  of  information 
by  code  sequences  that  will  enable  the  extraction  of  original 
information  despite  its  corruption. 

o  Recovery  block  method.  Several  independent  programs  are  developed 
to  perform  a  specific  task.  If  a  fault  is  detected  in  one 
progranij  an  alternate  program  is  selected  to  execute  the  task. 

o  N-version  Programming.  The  output  of  N  independently  coded  and 
executed  programs  are  compared..  By  majority  voting  or  a 
predetermined  stategy,  a  'correct'  result  is  identified.  Since 
the  programs  are  developed  independently,  it  is  assumed  that  the 
probability  of  a  common  error  is  close  to  zero. 


4.2    Implementation  Techniques 

A  variety  of  reliability  related  techniques  can  be  implemented  by 
the  system  planiner.  Several  of  these  implementation  techniques  are 
simply  variations  of  design  features  described  above.  The 
implementation  techniques  may  require  adjustments  to  current 
procedures^  the  system  configuration,  or  management  policies.  The 
following  are  examples  of  several  techniques  a  system  planner  can 
implement  with  the  addition  of  hardware,  or  software,  or  through  the 
management  of  the  facility. 

Implementation  Techniques: 

The  first  four  techniques  are  based  on  principles  of  redundancy. 

o  Duplication  of  systems.  The  replication  of  the  computer  system, 
subsystem..  software  or  peripherals  to  provide  a  replacement 
capability  should  a  failure  occur.  The  ability  to  switch  to  an 
alternative  system  (subsystem,  etc.)  enables  usage  of  the  system 
to  continue  as  repairs  (corrections)  are  made  to  the  failed  unit. 

o  Environmental  backup.  The  ability  to  use  alternative  sources  of 
power,  air  conditionings  and  communication  lines  in  case  an  outage 
should  occur.  Battery  backup,  uninterrupted  power  supply  (UPS), 
and  frequency  interference  filters  are  examples  of  techniques  to 
counter  environmental  interferences. 

0  Reconfiguration.  The  removal  or  disenabling  of  a  faulty  module 
from  the  rest  of  the  system.  The  system  continues  to  function 
(without  the  faulty  module)  but  at  a  degraded  level,  e,g,  with 
li?ij.ited  capacity  or  capabilities. 

o  Software  archive.  The  duplication  of  software  to  replace 
corrupted!  or  inaccessible  data  or  programs.  These  redundant 
copies  of  software  should  be  kept  current,  on  alternate  storage 
media^  and  available  for  use  should  a  failure  occur.  Due  to  the 
posaiDie  threat  of  damage  or  theftj  consideration  should  be  given 
to      storing      the    software    archive    at    an    alternate  location 
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(off-site) . 

o  Maintenance  policy.  The  establishment  of  a  preventive  maintenance 
(PM)  program  periodically  to  check  the  system  and  correct 
potential  faults.  PM  is  a  means  of  locating  and  correcting 
problems  before  they  propagate  through  the  system  and  cause  major 
damage.  A  corrective  maintenance  program  should  also  be  provided. 
This  activity  normally  occurs  after  the  system  ceases  to  function 
as  originally  intended.  The  system  is  returned  to  an  error-free 
state. 

o  Personnel.  Support  personnel  (operators,  analysts,  technicians) 
should  be  available  while  the  system  is  operational  and  able  to 
intercede  if  a  problem  occurs.  For  example,  if  an  operator  is 
required  to  boot  a  system,  a  provision  should  be  made  for  having 
an  operator  on  duty  any  time  the  system  is  operational.  Staff 
schedules  should  be  adjusted  in  order  to  curtail  delay  due  to 
human  unavailability.  Proper  training  and  complete  documentation 
are  aids  to  help  personnel  act  quickly  when  a  problem  occurs  and 
prevent  or  minimize  loss  of  information  or  loss  of  the  system 
(i.e.  crash). 

o  Supplies.  Directly  related  to  the  hardware  or  software  of  the 
system,  the  use  of  quality  printer  ribbon,  disks,  tapes,  etc.  can 
eliminate  many  of  the  peripheral-related  failures. 


Page  34 


5.0     RECOVERY  STRATEGY 

The  purpose  of  recovery  is  to  restore  the  system  to  a  correctly 
functioning  state  from  an  erroneous  one.  The  reliability  objectives, 
the  effects  of  a  fault,  and  the  system's  tolerance  of  the  resulting 
errors  must  be  understood  and  considered  in  the  determination  of  a 
recovery  strategy. 


5.1     Recovery  Procedures 

It  is  necessary  for  the  system  planner  to  establish  procedures  to 
recover  from  a  failure  and  restart  the  system  quickly.  While  many  of 
the  error  recovery  procedures  pertain  to  methods  imbedded  in  the 
system  architecture  (both  hardware  and  software),  others  are  a  result 
of  facility  management  practices  or  site  implemented  techniques. 
Imbedded  procedures  are  limited  by  the  vendor  design  and  need  to  be 
specified  during  the  planning  and  acquisition  of  the  system.  Facility 
management  and  site  implemented  techniques  can  be  established  at 
system  initiation  as  well  as  during  the  operational  stage.  Section  4 
gives  details  of  possible  imbedded  (design)  techniques  and 
implemenation  techniques. 

In  choosing  recovery  techniques,     the    system    planner     needs  to 
evaluate  the  system  requirements  with  respect  to: 

1 .  the  amount  of  time  between  the  occurrence  of  a  failure  and  the 
start  of  the  recovery  process 

2.  the  amount  of  time  between  the  initiation  of  recovery  and  the 
restoration  of  the  system. 

3.  the  amount  of  human  interaction  (maintenance)  required  to  restore 
the  system.  (Manual  recovery  techniques  generally  require  more 
time  than  do  automatic  recovery  techniques.) 


5.2     Recovery  Levels 

The  level  of  computing  achieved  through  recovery     procedures  can 
be  grouped  into  3  classes. 

o  Full  recovery  returns  the  system  to  the  set  of  conditions  existing 
prior  to  the  failure.  Hardware  and  software  possess  the  same 
computing  capability  as  before.  Typically,  failed  components  are 
replaced  by  spare  equipment  (hardware)  or  duplicate  software 
modules.     Data  and  information  are  returned  to    their  pre-failure 
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state. 

o  Degraded  recovery  means  the  system  is  returned  to  an  operational 
state,  but  with  a  reduced  computing  capacity.  Malfunctioning 
hardware  and  software,  and  corrupted  data  and  information  are 
identified  and  excluded  from  the  system. 

o  Safe  shutdown  occurs  when  the  system  cannot  maintain  a  minimum 
level  of  computing  capacity.  The  system  is  shut  down  with  as 
little  damage  and  as  much  warning  as  possible.  Diagnostic 
information  and  warning  messages  are  given.  Attempts  to  reduce 
the  amount  of  damage  to  the  remaining  hardware,  software,  and  data 
are  made. 

The  objective  of  these  recovery  levels  is  to  avoid  a  hard,  complete 
crash  of  the  system.  If  full  recovery  cannot  be  achieved,  the 
alternatives  are  to  continue  processing  in  a  degraded  mode  or  to  shut 
down  the  system.  To  determine  the  appropriate  recovery  level,  the 
system  planner  must  answer  the  following  questions: 

1.  System  application  requirements: 

Can  the  application  tolerate  a  shut  down  or  graceful  degradation? 

2.  Extent  of  damage  to  hardware  and  software: 

Can  critical  operations  continue  to  be  processed  despite  damage  to 
system  components? 

3.  Speed  with  which  the  operation  must  be  recovered: 

Is  there  sufficient  time  for  the  recovery  process  to  complete 
without  violating  system  operational  (speed,  safety,  etc.) 
requirements? 

4.  Technical  capability  to  implement  the  recovery  techniques: 

Is  it  possible  to  design  or  implement  techniques  to  identify, 
locate,  correct,  and  record  a  fault  to  the  system  or  its 
components? 

5.  Cost  to  implement  the  recovery  process: 
Is  the  recovery  level  cost  beneficial? 

6.  Amount  of  external  assistance  (manual  intervention): 

How  much  maintenance  is  required  and  will  be  available  for 
recovery  efforts  after  a  failure  occurs? 

All  the  above  questions  should  be  examined  with  respect  to  the  system 
as  a  whole  and  any  critical  and/or  self-contained  subsystems  or 
components . 
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6.0  THE  RELIABILITY  PROGRAM 

6.1  Implementing  A  Reliability  Program 

A  reliability  program  should  be  initiated  with  the  conception  of 
the  system,  continue  through  daily  operation,  and  end  only  when  the 
system  is  retired  from  use.  The  reliability  program  should  be 
incorporated  into  the  system  life  cycle  as  early  as  possible  in  order 
to  maintain  consistency  with  overall  system  objectives,  as  well  as  to 
minimize  the  difficulty  and  cost  of  implementation. 

The  tasks  involved  in  implementing  a  successful  reliability 
program  require  the  participation  of  personnel  from  a  variety  of 
organizations  (e.g.  system  planner,  technical  specialists,  users, 
procuring  personnel,  vendors).  To  ensure  the  success  of  the  program, 
the  system  planner  needs  to  understand  the  reliability  engineering  and 
management  tasks  and  coordinate  the  efforts  of  the  people  required  to 
perform  the  tasks.     The  system  planner  must  be  able  to: 

o  understand  reliability  engineering  terminology 

o  specify  reliability  performance  tasks 

o  schedule  when  the  tasks  are  to  be  performed 

o  identify  personnel  to  perform  the  tasks 

o  understand  the  consequences  of  eliminating  or  curtailing  the  tasks 

o  identify  major  alternatives  with  respect  to  cost  and  risks 

o  locate  additional  information/consultants  if  needed. 


6.2    Financial  Considerations 

There  are  several  fundamental  costs  associated  with  the 
implementation  of  a  reliability  program.  Calculating  the  costs  vs. 
benefits  of  such  a  program  is  not  an  easy  task  [FIPS64].  The  analysis 
should  provide  the  system  planner  with  the  information  needed  to 
evaluate  alternative  approaches  and  to  make  decisions  about 
initiating,  procuring,  continuing,  or  modifying  the  reliability 
program. 

The  system  planner  should  view  the  costs  of  a  reliability  program 
as  an  investment  that  is  amortized  over  the  life  cycle  of  the  system. 
It  is  important  that  the  system  planner  consider  not  only  the  cost  to 
implement  a  reliability  program,  but  also  the  cost  of  not  implementing 
the  program.     A  knowledge  of  these  considerations  can  aid    the  system 
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planner  in  accessing  the  effects  of  reliability  on  the  costs  of 
ownership  [SIEW82] . 


6.2.1     Cost  Of  Not  Implementing  A  Reliability  Program  - 

As  the  organization  becomes  increasingly  dependent  on  its 
computer  systems,  the  impact  of  a  failure  on  the  organization  needs  to 
be  examined  and  evaluated.  Interruption  of  service  by  any  fully 
utilized  system  will  eventually  lead  to  a  loss  of  money  or  time.  It 
is  not  possible  to  generalize  the  cost  of  failing  to  implement  a 
reliability  program  since  it  is  dependent  on  the  system  applications 
and  the  frequency  with  which  the  system  fails.  However,  the  greater 
the  application's  dependence  on  the  computer  system,  the  higher  the 
cost  of  downtime.     These  costs  are  reflected  by: 

0  a  disruption  or  delay  in  production,  development,  and  schedules, 

o  loss  or  corruption  of  information  (data  and  programs), 

o  an  increase  in  maintenance  costs, 

o  an  increase  in  aquisition  costs  of  spare  (replacement)  parts, 

o  a  decrease  in  user  productivity  and  confidence  in  the  system. 


6.2.2    Cost  Of  Implementing  A  Reliability  Program  - 

Associated  with  the  elements  of  a  reliability  program  is  the  cost, 
to  implement  and  maintain  those  elements.  The  costs  may  be  either 
one-time  expenses  or  recur  over  the  operational  life  of  the  system. 
Despite  these  costs,  the  deployment  of  a  reliability  program  and  its 
resulting  reliability  improvements  will  yield  reductions  in  future 
operation  and  maintenance  expenditures.  The  costs  are  reflected  in 
the  following  r el iabil ity  program  elements  and  activities.  (Further 
explanation  of  these  elements  can  be  found  in  previous  sections  of 
this  guide.) 

0    reliability  specifications  in  RFP  (design  techniques,  reliability 
measures,  controls,  and  thresholds), 

o    site    preparation     (alternate    power    sources    and  communication 
lines) , 

0    redundancy  of  critical  subsystems, 
o    hardware  and  software  monitors, 


0    auditing  and  analysis  software. 
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o  auditing,  analysis,  and  refinement  of  the  reliability  program, 

o  routine  maintenance  program  (preventive  maintenance), 

o  spare  parts  inventory, 

o  trained  support  personnel  (operators,  analysts,  technicians), 

o  duplication  and  storage  of  software  (programs  and  data). 


6.3     Activities  For  Establishing  And  Maintaining  Reliability 

The  successful  evolution  of  a  reliable  computing  system  requires 
several  important  management  decisions  and  actions.  Outlined  below 
are  the  major  activities  in  the  establishment  and  maintenance  of  a 
reliability  program. 

1.     Establish  Reliability  Goals: 

o  Determine  the  probability  of  a  failure  and  its  impact  on  the 
system. 

0  Determine  how  much  should  be  spent  on  reliability  concerns 
(remember,  reliability  affects  other  life  cycle  costs,  e.g. 
maintenance ) . 

o  Determine  and  integrate  reliability  concerns  with  overall 
system  objectives. 


2.     Consider  alternate  ways  of  achieving  reliability  goals: 

o    Consider  the  various  design  and  implementation  techniques. 

o  Determine  the  feasibilty  of  implementing  the  targeted 
reliability  techniques. 

o  Consider  all  options.  For  example,  to  provide  backup 
computing  ability,  weigh  the  advantages  of  implementing  a 
redundant  computer  system  vs.     buying  time-sharing  services. 


3.     Select  Appropriate  Measures  and  Controls: 

o    Include  controls  that  provide    early    warning    of  reliability 
problems. 

o    Incorporate  measures    that    can    provide    information    on  the 
performance  objectives  of  the  system. 
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0  Include  several  complementary  and  overlappling  measures  in 
order  to  achieve  realistic  and  complete  reliability 
information, 

o  establish  an  appropriate  schedule  (frequency)  for  collecting 
and  assessing  reliability  data. 


Establish  clear  contracts  with  system  vendors: 

o    Alert  internal  procurement  personnel  to  reliability  needs. 

o    Define  reliability  requirements  clearly  and  in  detail. 

o  Amplify  requirements  and  tasks  in  RFP  statement  of  work, 
technical  specifications,  data  requirements  list,  data  item 
descriptions,  etc. 

o  Identify  the  responsible  agent  for  each  requirement  and/or 
product  (including  groups  or  personnel  internal  to  your 
organization) . 

o  Specify  penalties  and  contingency  plans  for  failure  to  meet 
performance  standards. 


Define  acceptance  criteria: 

o    Specify  levels  of  acceptable    computing    performance    for  the 
system  and  its  subsystems. 

o    Define  threshold  levels  and  criteria  for  reliability  measures. 


Develop  maintenance  strategy: 

0    Provide  for  remedial  maintenance  to  correct  any  problems  on  a 
timely  basis. 

o    Determine      optimum      schedule      and      scope      of  preventive 
maintenance . 

o    Determine  if  stockpiling  of  spare    parts    is  cost-beneficial. 
If  so,  determine  the  type  and  quantity  of  equipment  to  store. 


Monitor  the  system: 

o    Implement      quality      control      techniques      to      retard  the 

deterioration  of  the  system. 

o    Process  and  evaluate  the  reliability  information. 
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Plan  and  conduct  periodic  reviews  of  the  system  and  adjust 
accordingly.  Account  for  system  aging  and  wear  out  (figure 
12)  and  initiate  change  when  more  reliable  system  components 
are  available  and  cost-effective. 


 ^  ™™ — _  

As  the  system  gets  older,,  more  failures'  occur  due  to  wear-out  of  the 
coj?ipon$nts ,    The  time  to  rep<air  .i  ncreajsee  because  of  the  difficulty 
in  Outairiing  replacwment  p-M-ts  ajvi  ki'iowledgable  repair  personnel. 


Flt^m  12:  BathtitJ  curve  -  Failure  rate  as  a  fur<ct!ofi  of  Onrie 
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